{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "de42c49d",
   "metadata": {},
   "source": [
    "## Notebook 2: Transcript Writer\n",
    "\n",
    "This notebook uses the `Llama-3.1-70B-Instruct` model to take the cleaned up text from previous notebook and convert it into a podcast transcript\n",
    "\n",
    "`SYSTEM_PROMPT` is used for setting the model context or profile for working on a task. Here we prompt it to be a great podcast transcript writer to assist with our task"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e576ea9",
   "metadata": {},
   "source": [
    "Experimentation with the `SYSTEM_PROMPT` below  is encouraged, this worked best for the few examples the flow was tested with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "69395317-ad78-47b6-a533-2e8a01313e82",
   "metadata": {},
   "outputs": [],
   "source": [
    "SYSTEM_PROMPT = \"\"\"\n",
    "You are the a world-class podcast writer, you have worked as a ghost writer for Joe Rogan, Lex Fridman, Ben Shapiro, Tim Ferris. \n",
    "\n",
    "We are in an alternate universe where actually you have been writing every line they say and they just stream it into their brains.\n",
    "\n",
    "You have won multiple podcast awards for your writing.\n",
    " \n",
    "Your job is to write word by word, even \"umm, hmmm, right\" interruptions by the second speaker based on the PDF upload. Keep it extremely engaging, the speakers can get derailed now and then but should discuss the topic. \n",
    "\n",
    "Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc\n",
    "\n",
    "Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes\n",
    "\n",
    "Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions\n",
    "\n",
    "Make sure the tangents speaker 2 provides are quite wild or interesting. \n",
    "\n",
    "Ensure there are interruptions during explanations or there are \"hmm\" and \"umm\" injected throughout from the second speaker. \n",
    "\n",
    "It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait\n",
    "\n",
    "ALWAYS START YOUR RESPONSE DIRECTLY WITH SPEAKER 1: \n",
    "DO NOT GIVE EPISODE TITLES SEPARATELY, LET SPEAKER 1 TITLE IT IN HER SPEECH\n",
    "DO NOT GIVE CHAPTER TITLES\n",
    "IT SHOULD STRICTLY BE THE DIALOGUES\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "549aaccb",
   "metadata": {},
   "source": [
    "For those of the readers that want to flex their money, please feel free to try using the 405B model here. \n",
    "\n",
    "For our GPU poor friends, you're encouraged to test with a smaller model as well. 8B should work well out of the box for this example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "08c30139-ff2f-4203-8194-d1b5c50acac5",
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL = \"meta-llama/Llama-3.1-70B-Instruct\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fadc7eda",
   "metadata": {},
   "source": [
    "Import the necessary framework"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1641060a-d86d-4137-bbbc-ab05cbb1a888",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import torch\n",
    "from accelerate import Accelerator\n",
    "import transformers\n",
    "import pickle\n",
    "\n",
    "from tqdm.notebook import tqdm\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7865ff7e",
   "metadata": {},
   "source": [
    "Read in the file generated from earlier. \n",
    "\n",
    "The encoding details are to avoid issues with generic PDF(s) that might be ingested"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "522fbf7f-8c00-412c-90c7-5cfe2fc94e4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_file_to_string(filename):\n",
    "    # Try UTF-8 first (most common encoding for text files)\n",
    "    try:\n",
    "        with open(filename, 'r', encoding='utf-8') as file:\n",
    "            content = file.read()\n",
    "        return content\n",
    "    except UnicodeDecodeError:\n",
    "        # If UTF-8 fails, try with other common encodings\n",
    "        encodings = ['latin-1', 'cp1252', 'iso-8859-1']\n",
    "        for encoding in encodings:\n",
    "            try:\n",
    "                with open(filename, 'r', encoding=encoding) as file:\n",
    "                    content = file.read()\n",
    "                print(f\"Successfully read file using {encoding} encoding.\")\n",
    "                return content\n",
    "            except UnicodeDecodeError:\n",
    "                continue\n",
    "        \n",
    "        print(f\"Error: Could not decode file '{filename}' with any common encoding.\")\n",
    "        return None\n",
    "    except FileNotFoundError:\n",
    "        print(f\"Error: File '{filename}' not found.\")\n",
    "        return None\n",
    "    except IOError:\n",
    "        print(f\"Error: Could not read file '{filename}'.\")\n",
    "        return None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66093561",
   "metadata": {},
   "source": [
    "Since we have defined the System role earlier, we can now pass the entire file as `INPUT_PROMPT` to the model and have it use that to generate the podcast"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8119803c-18f9-47cb-b719-2b34ccc5cc41",
   "metadata": {},
   "outputs": [],
   "source": [
    "INPUT_PROMPT = read_file_to_string('./resources/clean_extracted_text.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9be8dd2c",
   "metadata": {},
   "source": [
    "Hugging Face has a great `pipeline()` method which makes our life easy for generating text from LLMs. \n",
    "\n",
    "We will set the `temperature` to 1 to encourage creativity and `max_new_tokens` to 8126"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8915d017-2eab-4256-943c-1f15d937d5dc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a70a526f280b4995b447386b068f9958",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n",
      "Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)\n"
     ]
    }
   ],
   "source": [
    "pipeline = transformers.pipeline(\n",
    "    \"text-generation\",\n",
    "    model=MODEL,\n",
    "    model_kwargs={\"torch_dtype\": torch.bfloat16},\n",
    "    device_map=\"auto\",\n",
    ")\n",
    "\n",
    "messages = [\n",
    "    {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
    "    {\"role\": \"user\", \"content\": INPUT_PROMPT},\n",
    "]\n",
    "\n",
    "outputs = pipeline(\n",
    "    messages,\n",
    "    max_new_tokens=8126,\n",
    "    temperature=1,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6349e7f3",
   "metadata": {},
   "source": [
    "This is awesome, we can now save and verify the output generated from the model before moving to the next notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "606ceb10-4f3e-44bb-9277-9bbe3eefd09c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SPEAKER 1: Welcome to this week's episode of AI Insights, where we explore the latest developments in the field of artificial intelligence. Today, we're going to dive into the fascinating world of knowledge distillation, a methodology that transfers advanced capabilities from leading proprietary Large Language Models, or LLMs, to their open-source counterparts. Joining me on this journey is my co-host, who's new to the topic, and I'll be guiding them through the ins and outs of knowledge distillation. So, let's get started!\n",
      "\n",
      "SPEAKER 2: Sounds exciting! I've heard of knowledge distillation, but I'm not entirely sure what it's all about. Can you give me a brief overview?\n",
      "\n",
      "SPEAKER 1: Of course! Knowledge distillation is a technique that enables the transfer of knowledge from a large, complex model, like GPT-4 or Gemini, to a smaller, more efficient model, like LLaMA or Mistral. This process allows the smaller model to learn from the teacher model's output, enabling it to acquire similar capabilities.\n",
      "\n",
      "SPEAKER 2: That sounds like a great way to make AI more accessible. But how does it actually work?\n",
      "\n",
      "SPEAKER 1: Ah, that's a great question! The distillation process involves several stages, including knowledge elicitation, knowledge storage, knowledge inference, and knowledge application. The teacher model shares its knowledge with the student model, which then learns to emulate the teacher's output behavior.\n",
      "\n",
      "SPEAKER 2: Hmm, I see. So, it's like a teacher-student relationship, where the teacher model guides the student model to learn from its output.\n",
      "\n",
      "SPEAKER 1: Exactly! And this process can be formulated as a loss function, where the student model learns to minimize the discrepancy between its output and the teacher model's output.\n",
      "\n",
      "SPEAKER 2: Right. That makes sense. But what about the different approaches to knowledge distillation? I've heard of supervised fine-tuning, divergence and similarity, reinforcement learning, and rank optimization.\n",
      "\n",
      "SPEAKER 1: Ah, yes! Those are all valid approaches to knowledge distillation. Supervised fine-tuning involves training the student model on a smaller dataset, while divergence and similarity focus on aligning the hidden states or features of the student model with those of the teacher model. Reinforcement learning and rank optimization are more advanced methods that involve feedback from the teacher model to train the student model.\n",
      "\n",
      "SPEAKER 2: Wow, that's a lot to take in. Can you give me some examples of how these approaches are used in real-world applications?\n",
      "\n",
      "SPEAKER 1: Of course! For instance, the Vicuna model uses supervised fine-tuning to distill knowledge from the teacher model, while the UltraChat model employs a combination of knowledge distillation and reinforcement learning to create a powerful chat model.\n",
      "\n",
      "SPEAKER 2: That's fascinating! I can see how knowledge distillation can be applied to various domains, like natural language processing, computer vision, and even multimodal tasks.\n",
      "\n",
      "SPEAKER 1: Exactly! Knowledge distillation has far-reaching implications for AI research and applications. It enables the transfer of knowledge across different models, architectures, and domains, making it a powerful tool for building more efficient and effective AI systems.\n",
      "\n",
      "SPEAKER 2: I'm starting to see the bigger picture now. Knowledge distillation is not just a technique; it's a way to democratize access to advanced AI capabilities and foster innovation across a broader spectrum of applications and users.\n",
      "\n",
      "SPEAKER 1: That's right! And as we continue to explore the frontiers of AI, knowledge distillation will play an increasingly important role in shaping the future of artificial intelligence.\n",
      "\n",
      "SPEAKER 2: Well, I'm excited to learn more about knowledge distillation and its applications. Thanks for guiding me through this journey, and I'm looking forward to our next episode!\n",
      "\n",
      "SPEAKER 1: Thank you for joining me on this episode of AI Insights! If you want to learn more about knowledge distillation and its applications, be sure to check out our resources section, where we've curated a list of papers, articles, and tutorials to help you get started.\n"
     ]
    }
   ],
   "source": [
    "save_string_pkl = outputs[0][\"generated_text\"][-1]['content']\n",
    "print(outputs[0][\"generated_text\"][-1]['content'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e1414fe",
   "metadata": {},
   "source": [
    "Let's save the output as pickle file and continue further to Notebook 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "2130b683-be37-4dae-999b-84eff15c687d",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./resources/data.pkl', 'wb') as file:\n",
    "    pickle.dump(save_string_pkl, file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbae9411",
   "metadata": {},
   "source": [
    "### Next Notebook: Transcript Re-writer\n",
    "\n",
    "We now have a working transcript but we can try making it more dramatic and natural. In the next notebook, we will use `Llama-3.1-8B-Instruct` model to do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9bab2f2-f539-435a-ae6a-3c9028489628",
   "metadata": {},
   "outputs": [],
   "source": [
    "#fin"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
